Local correlation clustering

نویسندگان

  • Francesco Bonchi
  • David García-Soriano
  • Konstantin Kutzkov
چکیده

Correlation clustering is perhaps the most natural formulation of clustering. Given n objects and a pairwise similarity measure, the goal is to cluster the objects so that, to the best possible extent, similar objects are put in the same cluster and dissimilar objects are put in different clusters. Despite its theoretical appeal, the practical relevance of correlation clustering still remains largely unexplored. This is mainly due to the fact that correlation clustering requires the Θ(n) pairwise similarities as input. In large datasets this is infeasible to compute or even only to store. In this paper we initiate the investigation into local algorithms for correlation clustering, laying the theoretical foundations for clustering “big data”. In local correlation clustering we are given the identifier of a single object and we want to return the cluster to which it belongs in some globally consistent near-optimal clustering, using a small number of similarity queries. Local algorithms for correlation clustering open the door to sublinear-time algorithms, which are particularly useful when the similarity between items is costly to compute, as it is often the case in many practical application domains. They also imply (i) distributed and streaming clustering algorithms, (ii) constant-time estimators and testers for cluster edit distance, and (iii) property-preserving parallel reconstruction algorithms for clusterability. Specifically, we devise a local clustering algorithm attaining a (3, ε)-approximation (a solution with cost at most 3 · OPT + εn, where OPT is the optimal cost). Its running time is O(1/ε) independently of the dataset size. If desired, an explicit approximate clustering for all n objects can be produced in time O(n/ε) (which is provably optimal). We also provide a fully additive (1, ε)-approximation with local query complexity poly(1/ε) and time complexity 2. The explicit clustering can be found in time n · poly(1/ε) + 2. The latter yields the fastest polynomial-time approximation scheme for correlation clustering known to date.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Entropy-based Consensus for Distributed Data Clustering

The increasingly larger scale of available data and the more restrictive concerns on their privacy are some of the challenging aspects of data mining today. In this paper, Entropy-based Consensus on Cluster Centers (EC3) is introduced for clustering in distributed systems with a consideration for confidentiality of data; i.e. it is the negotiations among local cluster centers that are used in t...

متن کامل

Bounding and Comparing Methods for Correlation Clustering Beyond ILP

We evaluate several heuristic solvers for correlation clustering, the NP-hard problem of partitioning a dataset given pairwise affinities between all points. We experiment on two practical tasks, document clustering and chat disentanglement, to which ILP does not scale. On these datasets, we show that the clustering objective often, but not always, correlates with external metrics, and that loc...

متن کامل

Learning Mixtures of Multi-Output Regression Models by Correlation Clustering for Multi-View Data

In many datasets, different parts of the data may have their own patterns of correlation, a structure that can be modeled as a mixture of local linear correlation models. The task of finding these mixtures is known as correlation clustering. In this work, we propose a linear correlation clustering method for datasets whose features are pre-divided into two views. The method, called Canonical Le...

متن کامل

Correlation Clustering for Learning Mixtures of Canonical Correlation Models

This paper addresses the task of analyzing the correlation between two related domains X and Y . Our research is motivated by an Earth Science task that studies the relationship between vegetation and precipitation. A standard statistical technique for such problems is Canonical Correlation Analysis (CCA). A critical limitation of CCA is that it can only detect linear correlation between the tw...

متن کامل

SLICE: A Novel Method to Find Local Linear Correlations by Constructing Hyperplanes

Finding linear correlations in dataset is an important data mining task, which can be widely applied in the real world. Existing correlation clustering methods combine clustering with PCA to find correlation clusters in dataset. These methods may miss some correlations when instances are sparsely distributed. Previous studies are limited to find the primary linear correlation of the dataset. Ho...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1312.5105  شماره 

صفحات  -

تاریخ انتشار 2013